T-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome

نویسندگان

  • Yu Peng
  • Francis Y.L. Chin
چکیده

RNA sequencing based on next-generation sequencing technology is useful for analyzing transcriptomes, discovering novel genes and studying exon/intron structures. Similar to genome assembly, de novo transcriptome assembly does not rely on a reference genome and additional annotated information. Most, if not all, existing de novo transcriptome assemblers rely heavily on de novo genome assembly techniques without fully utilizing the properties of transcriptomes and may result in short contigs because of the splicing nature (shared exons) of the genes and the repeats that exist in different genes. In this paper, we analyze the properties of the mammalian transcriptome and propose an algorithm to reconstruct expressed isoforms without a reference genome. We extend the iterative de Bruijn graph approach (IDBA) by using pair-end information to solve the problem of long repeats in different genes and the problem of branching in the same gene due to alternative splicing. The graph will be decomposed into small components, each of which corresponds to a few, if not single, genes. The most possible isoforms with sufficient support from the pair-end reads will be found heuristically by depth-first search. In practice, our de novo transcriptome assembler, T-IDBA, outperforms Abyss (one of the newest de novo transcriptome assembler) substantially in terms of sensitivity and precision for both simulated and real data. The experimental results also match with our theoretical analysis of the performance of T-IDBA, which guarantees most isoforms can be reconstructed as long as their coverage exceeds a certain threshold. Availability: T-IDBA is available at http://www.cs.hku.hk/~alse/idba/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

T-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome - (Extended Abstract)

RNA sequencing based on next-generation sequencing technology is useful for analyzing transcriptomes, discovering novel genes and studying exon/intron structures. Similar to genome assembly, de novo transcriptome assembly does not rely on a reference genome and additional annotated information. Most, if not all, existing de novo transcriptome assemblers rely heavily on de novo genome assembly t...

متن کامل

De Bruijn Graph based De novo Genome Assembly

The Next Generation Sequencing (NGS) is an important process which assures inexpensive organization of vast size of raw sequence data set over any traditional sequencing systems or methods. Various aspects of NGS like template preparation, sequencing imaging and genome alignment and assembly outlines the genome sequencing and alignment .Consequently, deBruijn Graph (dBG) is an important mathema...

متن کامل

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

The de Bruijn graph assembly approach breaks reads into k-mers before assembling them into contigs. The string graph approach forms contigs by connecting two reads with k or more overlapping nucleotides. Both approaches must deal with the following problems: false-positive vertices, due to erroneous reads; gap problem, due to non-uniform coverage; branching problem, due to erroneous reads and r...

متن کامل

IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels

MOTIVATION RNA sequencing based on next-generation sequencing technology is effective for analyzing transcriptomes. Like de novo genome assembly, de novo transcriptome assembly does not rely on any reference genome or additional annotation information, but is more difficult. In particular, isoforms can have very uneven expression levels (e.g. 1:100), which make it very difficult to identify low...

متن کامل

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth

MOTIVATION Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013